Appendix B — Assignment B

Inference & Correlation

Instructions

  1. You may talk to a friend, discuss the questions and potential directions for solving them. However, you need to write your own solutions and code separately, and not as a group activity.

  2. Make R code chunks to insert code and type your answer outside the code chunks. Ensure that the solution is written neatly enough to understand and grade.

  3. Render the file as HTML to submit. For theoretical questions, you can either type the answer and include the solutions in this file, or write the solution on paper, scan and submit separately.

  4. The assignment is worth 100 points, and is due on 14th October 2023 at 11:59 pm.

  5. Five points are properly formatting the assignment. The breakdown is as follows:

  • Must be an HTML file rendered using Quarto (the theory part may be scanned and submitted separately) (2 pts).
  • There aren’t excessively long outputs of extraneous information (e.g. no printouts of entire data frames without good reason, there aren’t long printouts of which iteration a loop is on, there aren’t long sections of commented-out code, etc.). There is no piece of unnecessary / redundant code, and no unnecessary / redundant text (1 pt)
  • Final answers of each question are written clearly (1 pt).
  • The proofs are legible, and clearly written with reasoning provided for every step. They are easy to follow and understand (1 pt)

All question are based on the normal linear regression model below, unless otherwise stated:

\(Y_i = \beta_0 + \beta_1X_i + \epsilon_i, i = 1,...,n\), where \(\epsilon \sim N(0, \sigma^2)...(1)\)

B.1

The dataset crime.txt consists of number of crimes per 100,000 residents and percentage of individuals having high school diploma in 84 counties.

B.1.1

Estimate the expected decrease in number of crimes per 100,000 residents when the proportion of individuals having high school diploma increases by 1%. Use a 99% confidence interval. Interpret the interval estimate.

(4 + 2 = 6 points)

B.1.2

For what percentage of individuals having a high school diploma in a county will you be the most confident in estimating the crime rate with a confidence interval of a given width?

(2 points)

B.1.3

For what range of values of the percentage of individuals having a high school diploma in a county will it be inappropriate to use the developed model (in B.1.1) to predict the crime rate?

(2 points)

B.1.4

Predict the number of crimes per 100,000 residents of a county using a 90% prediction interval if 75% of the individuals have a high school diploma in the county. Interpret the prediction interval.

(2 + 2 = 4 points)

B.1.5

A consultant had stated that the expected number of crimes per 100,000 should reduce by at least 1000 for a 4% increase in the proportion of people having high school diplomas. Conduct a hypothesis test to verify the statement. Consider the probability of type 1 error as 5%. State the null and alternate hypotheses, p-value, and conclusion from the test.

(2 + 2 + 2 = 6 points)

B.1.6

What is the probability that you will reject the null hypothesis in the previous question if it is actually false, and the true value of \(\beta_1\) is \(\beta_1 = -200\).

(7 points)

B.1.7

By how much is the total variation in crime rate reduced when percentage of high school graduates is introduced into the analysis?

(2 points)

B.2

Suppose that the normal error regression model is applicable except that the error variance is not constant; rather the variance is larger, the larger is \(X\). Does \(\beta_1=0\) still imply that there is no linear association between \(X\) and \(Y\)? Explain.

Does it also imply that there is no association between \(X\) and \(Y\)? Or, does it also imply that there is an association between \(X\) and \(Y\)? Explain.

(2 + 2 + 2 = 6 points)

B.3

Show that the regression sum of squares has only one degree of freedom for model (1). You may use the normal equations obtained from minimizing of sum of squared errors.

(4 points)

B.4

In a small-scale regression study, five observations on \(Y\) were obtained corresponding to \(X=1, 4, 10, 11,\) and \(14\). Assume that \(\sigma = 0.6, \beta_0 = 5,\) and \(\beta_1=3\).

B.4.1

What are the expected values of MSR and MSE?

(3 + 2 = 5 points)

B.4.2

For determining whether or not a regression relation exists, would it have been better or worse to have made the five observations at \(X=6,7,8,9,10\)? Why? Would the same answer apply if the principal purpose were to estimate the mean response for \(X=8\). Explain.

(3 + 3 = 6 points)

B.5

Consider the following code and its output below.

#Setting seed for reproducibility
set.seed(10)

#Simulating data
x <- seq(0, 1, 0.01)
y <- 0.5 + 2*x + rnorm(length(x))
data <- data.frame(x = x, y = y)

#Developing linear regression model
model <- lm(y~x, data = data)
summary(model)

Call:
lm(formula = y ~ x, data = data)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.8866 -0.6294  0.0103  0.6819  2.2644 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.08838    0.18370   0.481    0.631    
x            2.53776    0.31738   7.996 2.45e-12 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.9299 on 99 degrees of freedom
Multiple R-squared:  0.3924,    Adjusted R-squared:  0.3863 
F-statistic: 63.94 on 1 and 99 DF,  p-value: 2.448e-12

We see that the intercept is statistically insignificant. As it is insignificant, a student suggested that it won’t matter if it is removed from the developed model, i.e., if the developed model is changed to:

\(Y_i = \beta_1X_i + \epsilon_i, i = 1,...,n\), where \(\epsilon \sim N(0, \sigma^2) ... (2)\)

B.5.1

Given that \(\beta_0 \ne 0\), derive the expression for the bias in the estimate of \(\beta_1\) if model (2) is considered.

(4 points)

B.5.2

Use the simulated data to compute the value of the bias obtained from the expression derived in the previous question (B.5.1).

(2 points)

B.5.3

Use simulations to verify the value of bias computed in the previous question (B.5.2). Simulate the data 1000 times. Set a unique value of seed to generate a unique dataset in every iteration. In each iteration, estimate \(\beta_1\). Report the bias in the estimate of \(\beta_1\) based on the 1000 simulations. Is it the same as obtained analytically in the previous question (B.5.2)?

Note that: \(Bias(\hat{\beta}_1) = E(\hat{\beta}_1 - \beta_1)\)

(5 + 1 = 6 points)

B.5.4

Plot the 1000 regression lines based on the 1000 coefficient estimates of the no-intercept model (as obtained in B.5.3) over a scatterplot of the data points.

(2 points)

B.5.5

Now consider the simple linear regression model with the intercept (model 1). Simulate the data 1000 times. In each iteration, estimate \(\beta_1\). Report the bias in the estimate of \(\beta_1\) based on the 1000 simulations.

(4 points)

B.5.6

Plot the 1000 regression lines based on the 1000 coefficient estimates of the model with intercept (as obtained in B.5.5) over a scatterplot of the data points.

(2 points)

B.5.7

Based on the simulations and analytical results, answer the following:

B.5.7.1

If the intercept is found to be statistically insignificant in model (1) as shown in the model summary, should it be removed from the model, and model (1) considered instead? Explain.

(3 points)

B.5.7.2

Suppose we know that the true model passes through the origin \((X = 0, Y = 0)\). Should we consider model (1) or model (2) in this case? Explain.

(2 points)

B.5.8

In terms of bias and variance, mention an advantage and a disadvantage of the estimate of \(\beta_1\) from model (2) as compared to that obtained from model (1), if \(\beta_0 \ne 0\). The plots developed in earlier questions may also help visualize the advantage and disadvantage.

(2 + 2 = 4 points)

B.6

Prove that the estimate of the intercept \(\beta_0\), i.e., \(\hat{\beta}_0\) for model (1) has minimum variance among all unbiased linear estimators.

(5 points)

B.7

Derive the variance of the estimators obtained in questions A.8 and A.9 of Assignment A.

(2 + 2 = 4 points)

B.8

Suppose the true value of \(\beta_0\) is known, while that of \(\beta_1\) is unknown and needs to be estimated. The stakeholder wishes to choose the model (from model (1) and model(2)), such that the expected squared error in the estimate of \(\beta_1\), is the minimum, i.e., they wish to select the model that has a lesser value of \(E(\hat{\beta}_1 - \beta_1)^2\). Derive the condition (based on the value \(\beta_0\)), which if true will imply that model (2) should be chosen instead of model (1).

Hint:

\(E(X^2) = [E(X)]^2 + Var(X)\)

\(\implies E(\hat{\beta}_1 - \beta_1)^2 = [E(\hat{\beta}_1 - \beta_1)]^2 + Var(\hat{\beta}_1 - \beta_1)\)

Model (2) should be selected when \(E(\hat{\beta}_1 - \beta_1)^2\) for model (2) is lesser than \(E(\hat{\beta}_1 - \beta_1)^2\) for model (1).

(7 points)